Take-home Exercise 3

Predicting HDB Public Housing Resale Prices using Geographically Weighted Methods

Published

March 13, 2023

Modified

March 26, 2023

1 Context

1.1 Problem Statement

To use factors affecting HDB Resale Flat Prices for 3-room flats from 1st January 2021 to 31st December 2022 to build 2 predictive models (using conventional OLS method and GWR method).

Then using the models created to predict the resale prices from January to February 2023.

Lastly to evaluate and compare the performance between the models.

1.2 Data

Data Format (Type) Description Source
Master Plan 2014 Subzone Boundary .shp
Spatial polygon data frame (Geospatial)
Singapore Boundary Shapefile In-class Exercise 9
HDB Resale Flat Prices .csv HDB Resale Flat Prices with property details Data.gov.sg
Childcare Spatial point data frame (Geospatial) Points of all the childcare centers in Singapore onemapsg API
Pre-schools Spatial point data frame (Geospatial) Points of all the pre-schools in Singapore onemapsg API
Kindergartens Spatial point data frame (Geospatial) Points of all the kindergartens in Singapore onemapsg API
Primary Schools Spatial point data frame (Geospatial) Points of all the primary schools in Singapore data.world (Hui Xiang Chua, 2017)
Parks Spatial point data frame (Geospatial) Points of all the parks in Singapore onemapsg API
Cemeteries Spatial point data frame (Geospatial) Points of all the active cemeteries in Singapore onemapsg API
Eldercare Facilities Spatial point data frame (Geospatial) Points of all the eldercare facilities in Singapore onemapsg API
CBD Spatial point data frame (Geospatial) Point of downtown core Senior’s Submission (Megan)
Shopping Malls Spatial point data frame (Geospatial) Points of all the shopping malls in Singapore Self-sourced
(Information from Wikipedia)
Bus Stops .csv (Geospatial) Location of all the bus stops in Singapore Land Transport DATAMALL (Feb 2023)
MRT Stations .csv (Geospatial) Location of all the MRT stations in Singapore data.world (Hui Xiang Xhua, 2020)

1.3 Packages

pacman::p_load(sf, tidyverse, onemapsgapi, tmap, units, Rfast,
               olsrr, corrplot, ggpubr, spdep, GWmodel, gtsummary, jsonlite, SpatialML, ranger)
  • sf: to handle spatial data

  • tidyverse: to handle attribute data

  • onemapsgapi: to extract geospatial data sets

  • tmap: to visualise geospatial data via choropleth mapping

  • units, Rfast: to calculate distances and facility counts

  • olsrr: to build OLS and perform diagnostic tests

  • corrplot: to plot for multivariate data visualisation and analysis (check for correlation between variables)

  • ggpubr: to plot and visualise graphs (for EDA)

  • spdep: to create spatial weight matrix objects

  • GWmodel: to calibrate geographical weighted family of models

  • gtsummary: to create elegant tables for model evaluation

2 Importing and Wrangling Geospatial Data

2.1 Master Plan 2019 Subzone Boundary

First, import the shapefile using the st_read() function.

mpsz <- st_read(
  dsn = "data/geospatial/boundary", layer = "MPSZ-2019"
)
Reading layer `MPSZ-2019' from data source 
  `D:\bellekwang\IS415\take_home_ex\THE3\data\geospatial\boundary' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

2.2 onemapsg API

With reference to the onemapsg API tutorial (Megan), I first looked through all the themes the API had using the search_themes() function.

all_themes <- search_themes(token)
all_themes

Looking through all the themes available, I’ve decided to extract the following themes that may affect the pricing of HDB Resale Prices:

  1. Childcares
  2. Pre-schools
  3. Kindergartens
  4. Hawker Centers
  5. Parks
  6. Cemeteries
  7. Eldercare Facilities

To extract the data, I will repeat the code below for each theme.

First, I call the api using the get_theme() function, then create it as a simple feature data frame using the st_as_sf() function. I will only be keeping the necessary columns: NAME and geometry.

childcare_tibble <- get_theme(token, "childcare")
childcare_sf <- st_as_sf(childcare_tibble, coords=c("Lng", "Lat"), crs = 4326)
childcare_sf <- select(childcare_sf, c('NAME', 'geometry'))
Hawker Centers
hawker_tibble <- get_theme(token, "hawkercentre_new")
hawker_sf <- st_as_sf(hawker_tibble, coords=c("Lng", "Lat"), crs = 4326)
hawker_sf <- select(hawker_sf, c('NAME', 'geometry'))
Parks
parks_tibble <- get_theme(token, "nationalparks")
parks_sf <- st_as_sf(parks_tibble, coords=c("Lng", "Lat"), crs = 4326)
parks_sf <- select(parks_sf, c('NAME', 'geometry'))
Preschools
preschool_tibble <- get_theme(token, "preschools_location")
preschl_sf <- st_as_sf(preschool_tibble, coords=c("Lng", "Lat"), crs = 4326)
preschl_sf <- select(preschl_sf, c('NAME', 'geometry'))
Cemeteries
cemetry_tibble <- get_theme(token, "activecemeteries")
cemetry_sf <- st_as_sf(cemetry_tibble, coords=c("Lng", "Lat"), crs = 4326)
cemetry_sf <- select(cemetry_sf, c('NAME', 'geometry'))
Eldercare Facilities
elderly_tibble <- get_theme(token, "eldercare")
eldercare_sf <- st_as_sf(elderly_tibble, coords=c("Lng", "Lat"), crs = 4326)
eldercare_sf <- select(eldercare_sf, c('NAME', 'geometry'))
Kindergarten
kindergartens_tibble <- get_theme(token, "kindergartens")
kindergartens_sf <- st_as_sf(kindergartens_tibble, coords=c("Lng", "Lat"), crs = 4326)
kindergartens_sf <- select(kindergartens_sf, c('NAME', 'geometry'))

2.3 Primary Schools

prischl <- read_csv("data/geospatial/prischl/primaryschoolsg.csv")
prischl_sf <-st_as_sf(prischl, coords=c("longitude", "latitude"), crs = 4326)
prischl_sf <- select(prischl_sf, c('name', 'geometry'))

2.4 CBD Location

With reference to senior’s submission (Megan), create a data frame with the coordinates of CBD (Downtown Core).

lat <- 1.287953
lng <- 103.851784

cbd_sf <- data.frame(lat, lng) %>%
  st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
  st_transform(crs=3414)

2.5 Shopping Malls

Since there are no readily available and updated data set on all the shopping malls in Singapore, I gathered a list of all the malls as shown here.

mall_name = c('100 AM', '313@Somerset', 'Aperia', 'Balestier Hill Shopping Centre', 'Bugis Cube', 'Bugis Junction', 'Bugis+', 'Capitol Piazza', 'Cathay Cineleisure Orchard', 'The Centrepoint', 'City Square Mall', 'CityLink Mall', 'Duo', 'Far East Plaza', 'Funan', 'Great World City', 'HDB Hub', 'ION Orchard', 'Junction 8', 'Knightsbridge', 'Liat Towers', 'Lucky Plaza', 'Marina Bay Sands', 'The Shoppes at Marina Bay Sands', 'Marina Bay Link Mall', 'Marina Square', 'Millenia Walk', 'Ngee Ann City', 'Orchard Central', 'Orchard Gateway', 'Orchard Plaza', 'Midpoint Orchard', 'Palais Renaissance', "People's Park Centre", "People's Park Complex", 'Plaza Singapura', 'Raffles City', 'Scotts Square', 'Sim Lim Square', 'Singapore Shopping Centre', 'The South Beach', 'Square 2', 'Sunshine Plaza', 'Suntec City', 'Tanglin Mall', 'Tanjong Pagar Centre', 'Tekka Centre', 'The Adelphi', 'The Paragon', 'Tiong Bahru Plaza', 'The Poiz', 'Thomson Plaza', 'United Square', 'Thomson V', 'Velocity@Novena Square', 'Wheelock Place', 'Wisma Atria', 'Zhongshan Mall', 'i12 Katong', 'Parkway Parade', 'Paya Lebar Square', 'Paya Lebar Quarter', 'Roxy Square', 'Singpost Centre', 'Tampines 1', 'Tampines Mall', 'White Sands', 'City Plaza', 'Elias Mall', 'Loyang Point', 'Bedok Mall', 'Century Square', 'Our Tampines Hub', 'Changi City Point', 'Downtown East', 'Djitsun Mall Bedok', 'Eastpoint Mall', 'Jewel Changi Airport', 'KINEX', 'Katong Shopping Centre', 'Katong Square', 'Kallang Wave Mall', 'Leisure Park Kallang', 'Waterway Point', 'Compass One', 'Hougang Mall', '888 Plaza', 'Admiralty Place', 'AMK Hub', 'Canberra Plaza', 'Causeway Point', 'Woodlands Civic Centre', 'Broadway Plaza', 'Djitsun Mall', 'Jubilee Square', 'Junction 8', 'Junction Nine', 'Marsiling Mall', 'Northpoint City', 'Sembawang Shopping Centre', 'Sun Plaza', 'Vista Point', 'Wisteria Mall', 'Woodlands Mart', 'Woodlands North Plaza', 'Heartland Mall', 'NEX', 'Buangkok Square', 'Greenwich V', 'Hougang 1', 'Hougang Green Shopping Mall', 'Hougang Rivercourt', 'myVillage At Serangoon Garden', 'Northshore Plaza', 'Oasis Terraces', 'Punggol Plaza', 'Rivervale Mall', 'Rivervale Plaza', 'The Seletar Mall', 'Upper Serangoon Shopping Centre', '321 Clementi', 'The Clementi Mall', 'IMM', 'JCube', 'Jem', 'Beauty World Centre', 'Beauty World Plaza', 'Bukit Panjang Plaza', 'Bukit Timah Plaza', 'Fajar Shopping Centre', 'Greenridge Shopping Centre', 'Hillion Mall', 'HillV2', 'Junction 10', 'Keat Hong Shopping Centre', 'Limbang Shopping Centre', 'Lot One', 'Rail Mall', 'Sunshine Place', 'Teck Whye Shopping Centre', 'West Mall', 'Yew Tee Point', 'Yew Tee Square', 'VivoCity', 'HarbourFront Centre', 'Alexandra Retail Centre', 'Westgate', 'Jurong Point', 'Pioneer Mall', 'The Star Vista', 'Alexandra Central', 'Anchorpoint', 'Boon Lay Shopping Centre', 'Grantral Mall', 'Fairprice Hub', 'Gek Poh Shopping Centre', 'Rochester Mall', 'Taman Jurong Shopping Centre', 'West Coast Plaza', 'Queensway Shopping Centre')

Next, I will create a tibble with the mall_name and 2 coordinate columns.

malls_tibble <- data.frame(mall_name)
malls_tibble$LATITUDE <- 0
malls_tibble$LONGITUDE <- 0

With reference to senior’s submission (Megan), I will be using the onemapsg API to retrieve all the mall locations using the function below.

library(httr)
geocode <- function(mallname) {
  base_url <- "https://developers.onemap.sg/commonapi/search"
  query <- list("searchVal" = mallname, 
                "returnGeom" = "Y",
                "getAddrDetails" = "N",
                "pageNum" = "1")
  
  res <- GET(base_url, query = query)
  restext<-content(res, as="text")
  
  output <- fromJSON(restext)  %>% 
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)

  return(output)
}

Run the function above to get the coordinates of the malls.

for (i in 1:nrow(malls_tibble)){
  temp_output <- geocode(malls_tibble[i, 1])
  
  malls_tibble$LATITUDE[i] <- temp_output$results.LATITUDE
  malls_tibble$LONGITUDE[i] <- temp_output$results.LONGITUDE
}

There were a few malls that the API was unable to search coordinates for so I will do them manually. These malls include: Clarke Quay Central, City Gate Mall, Holland Village Shopping Mall, Mustafa Shopping Centre, GR.iD, Shaw House and Centre, OD Mall.

to_add_malls <- data.frame(
 mall_name=c('Clarke Quay Central', 'City Gate Mall', 'Holland Village Shopping Mall', 'Mustafa Shopping Centre', 'GR.iD', 'Shaw House and Centre', 'OD Mall'),
 LATITUDE=c(1.288992, 1.3021799, 1.310988, 1.310074, 1.3002, 1.3035, 1.3380),
 LONGITUDE=c(103.846746, 103.8625, 103.795065, 103.855366, 103.8492, 103.8256, 103.7934)
)

Combine the new data frame into the overall tibble data frame and then convert it to a sf data frame.

malls_tibble <- rbind(malls_tibble, to_add_malls)
malls_sf <- st_as_sf(malls_tibble, coords = c("LONGITUDE", "LATITUDE"), crs = 4326)
glimpse(malls_sf)
Rows: 167
Columns: 2
$ mall_name <chr> "100 AM", "313@Somerset", "Aperia", "Balestier Hill Shopping…
$ geometry  <POINT [°]> POINT (103.8435 1.274683), POINT (103.8382 1.301007), …

2.6 MRT Stations

I was going to use the MRT station dataset provided by DataMall LTA but they had invalid geometries, thus I am using another dataset I found on data.world platform.

mrtstation <- read_csv("data/geospatial/mrtstation.csv")

I will first remove all the unnecessary columns and keeping only the station names and coordinates.

mrtstation <- select(mrtstation, c('STN_NAME', 'Latitude', 'Longitude'))

According to the MRT MAP website, this csv file is missing 11 MRT stations from the new Thomson-East Coast Line: TE4 Springleaf, TE5 Lentor, TE6 Mayflower, TE7 Bright Hill, TE8 Upper Thomson, TE12 Napier, TE13 Orchard Boulevard, TE15 Great World, TE16 Havelock, TE18 Maxwell, TE29 Shenton Way. I will be adding the 11 MRT stations and their coordinates manually.

mrtstations_to_add <- data.frame(
  STN_NAME=c("SPRINGLEAF MRT STATION", "LENTOR MRT STATION",
             "MAYFLOWER MRT STATION", "BRIGHT HILL  MRT STATION",
             "UPPER THOMSON  MRT STATION", "NAPIER MRT STATION",
             "ORCHARD BOULEVARD MRT STATION", "GREAT WORLD MRT STATION",
             "HAVELOCK MRT STATION", "MAXWELL MRT STATION",
             "SHENTON WAY MRT STATION"),
  Latitude=c(1.3978, 1.3846, 1.3724, 1.36384, 1.3542, 1.3069, 1.3024,
             1.295, 1.2877, 1.2803, 1.2775),
  Longitude=c(103.8182, 103.8368, 103.8372, 103.834748, 103.8332,
              103.8191, 103.8251, 103.833333, 103.8337, 103.844, 103.8507)
)

Combine the mrtstations_to_add via rbind() function.

mrtstation <- rbind(mrtstation, mrtstations_to_add)

Some MRT stations coexist in multiple lines (e.g. Bishan at the Northeast and Circle Line). They however have different coordinates so I will not be removing them from the data frame.

Lastly, transform the data frame into a spatial point data frame using the st_as_sf() function.

mrtstation_sf <- st_as_sf(mrtstation, coords = c("Longitude", "Latitude"), crs = 4326)
glimpse(mrtstation_sf)
Rows: 200
Columns: 2
$ STN_NAME <chr> "ADMIRALTY MRT STATION", "ALJUNIED MRT STATION", "ANG MO KIO …
$ geometry <POINT [°]> POINT (103.801 1.440585), POINT (103.8829 1.316433), PO…

2.7 Bus Stops

busstop_sf <- st_read(
  dsn = "data/geospatial/BusStop_Feb2023", layer = "BusStop"
)
busstop_sf <- select(busstop_sf, c('BUS_STOP_N', 'geometry'))

2.8 Transforming CRS Projection

mpsz <- st_transform(mpsz, 3414)
busstop_sf <- st_transform(busstop_sf, 3414)
cbd_sf <- st_transform(cbd_sf, 3414)
cemetry_sf <- st_transform(cemetry_sf, crs=3414)
childcare_sf <- st_transform(childcare_sf, crs=3414)
eldercare_sf <- st_transform(eldercare_sf, crs=3414)
hawker_sf <- st_transform(hawker_sf, crs=3414)
kindergartens_sf <- st_transform(kindergartens_sf, crs = 3414)
malls_sf <- st_transform(malls_sf, crs=3414)
mrtstation_sf <- st_transform(mrtstation_sf, crs=3414)
parks_sf <- st_transform(parks_sf, crs=3414)
preschl_sf <- st_transform(preschl_sf, crs=3414)
prischl_sf <- st_transform(prischl_sf, crs = 3414)

Now all the sf data frames are set to the correct CRS projection.

2.9 Checking for Missing Values

sum(is.na(busstop_sf))
sum(is.na(cemetry_sf))
sum(is.na(childcare_sf))
sum(is.na(eldercare_sf))
sum(is.na(hawker_sf))
sum(is.na(kindergartens_sf))
sum(is.na(malls_sf))
sum(is.na(mrtstation_sf))
sum(is.na(parks_sf))
sum(is.na(preschl_sf))
sum(is.na(prischl_sf))

3 Importing and Wrangling Asaptial Data

3.1 HDB Resale Flat Prices

hdb <- read_csv("data/aspatial/resale-flat-prices-2017.csv")
glimpse(hdb)

First, filter out rows for 3 room flats. Next, filter out training data from 1st January 2021 to 31st December 2022, and testing data from January to February 2023.

hdb_training <- hdb %>% 
  filter(flat_type == "3 ROOM") %>%
  filter(month >= "2021-01" & month <= "2022-12")

hdb_testing <- hdb %>% 
  filter(flat_type == "3 ROOM") %>%
  filter(month >= "2023-01" & month <= "2023-02")

Since there are only block and streetname available, we will need to retrieve the coordinates via onemapsg API. I will be using the method provided by the onemapsg API tutorial (by Megan)

First, replace the namings to line up with the naming onemapsg uses.

hdb_training$street_name <- gsub("ST\\.", "SAINT", hdb_training$street_name)
hdb_testing$street_name <- gsub("ST\\.", "SAINT", hdb_testing$street_name)

Next, create a function that gets the geocode from the block and streetname.

library(httr)
geocode <- function(block, streetname) {
  base_url <- "https://developers.onemap.sg/commonapi/search"
  address <- paste(block, streetname, sep = " ")
  query <- list("searchVal" = address, 
                "returnGeom" = "Y",
                "getAddrDetails" = "N",
                "pageNum" = "1")
  
  res <- GET(base_url, query = query)
  restext<-content(res, as="text")
  
  output <- fromJSON(restext)  %>% 
    as.data.frame %>%
    select(results.LATITUDE, results.LONGITUDE)

  return(output)
}

Finally, call the function on both the training and testing data.

hdb_training$LATITUDE <- 0
hdb_training$LONGITUDE <- 0

for (i in 1:nrow(hdb_training)){
  temp_output <- geocode(hdb_training[i, 4], hdb_training[i, 5])
  
  hdb_training$LATITUDE[i] <- temp_output$results.LATITUDE
  hdb_training$LONGITUDE[i] <- temp_output$results.LONGITUDE
}
hdb_testing$LATITUDE <- 0
hdb_testing$LONGITUDE <- 0

for (i in 1:nrow(hdb_testing)){
  temp_output <- geocode(hdb_testing[i, 4], hdb_testing[i, 5])
  
  hdb_testing$LATITUDE[i] <- temp_output$results.LATITUDE
  hdb_testing$LONGITUDE[i] <- temp_output$results.LONGITUDE
}

I will transform the data sets into sf dataframes using the st_as_sf() function.

hdb_testing_sf <- st_as_sf(hdb_testing, coords=c("LONGITUDE", "LATITUDE"), crs = 4326)
hdb_training_sf <- st_as_sf(hdb_training, coords=c("LONGITUDE", "LATITUDE"), crs = 4326)

Finally, transform the data frames to the correct CRS projection.

hdb_testing_sf <- st_transform(hdb_testing_sf, crs=3414)
hdb_training_sf <- st_transform(hdb_training_sf, crs = 3414)

3.2 Storey Range

Since storey range is categorical, with reference to senior’s submission (Megan), I will split it into binary values using the code below.

hdb_training_sf <- hdb_training_sf %>%
  pivot_wider(names_from = "storey_range", values_from = "storey_range", 
              values_fn = list(storey_range = ~1), values_fill = 0) 
hdb_testing_sf <- hdb_testing_sf %>%
  pivot_wider(names_from = "storey_range", values_from = "storey_range", 
              values_fn = list(storey_range = ~1), values_fill = 0) 

3.3 Proximity Calculation

With reference to senior’s submission (Megan), calculate the proximity between the nearest facilities and the HDB flats using the function as shown below.

proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- rowMins(dist_matrix)
  return(df1)
}

Run this function for the facilities that are going to be calculated.

hdb_training_sf <- 
  proximity(hdb_training_sf, cbd_sf, "PROX_CBD") %>%
  proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
  proximity(., hawker_sf, "PROX_HAWKER") %>%
  proximity(., mrtstation_sf, "PROX_MRT") %>%
  proximity(., parks_sf, "PROX_PARK") %>%
  proximity(., cemetry_sf, "PROX_CEMETRY") %>%
  proximity(., malls_sf, "PROX_MALL")
hdb_testing_sf <- 
  proximity(hdb_testing_sf, cbd_sf, "PROX_CBD") %>%
  proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
  proximity(., hawker_sf, "PROX_HAWKER") %>%
  proximity(., mrtstation_sf, "PROX_MRT") %>%
  proximity(., parks_sf, "PROX_PARK") %>%
  proximity(., cemetry_sf, "PROX_CEMETRY") %>%
  proximity(., malls_sf, "PROX_MALL")

For CBD, since it is one point only, the function needs to be modified accordingly.

proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- dist_matrix
  return(df1)
}
hdb_training_sf <- proximity(hdb_training_sf, cbd_sf, "PROX_CBD")
#hdb_testing_sf <- proximity(hdb_testing_sf, cbd_sf, "PROX_CBD")

3.4 Facility Count within Radius

To calculate the number of facilities within 350m radius of the HDB flats, we require a function as shown below. (Reference to senior’s submission by Megan)

Note

For primary school, I will be using a radius of 1km because it is more sparse as compared to the other factors.

num_radius <- function(df1, df2, varname, radius) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units() %>%
    as.data.frame()
  df1[,varname] <- rowSums(dist_matrix <= radius)
  return(df1)
}

Run this function for the facilities that are going to be calculated.

hdb_training_sf <- 
  num_radius(hdb_training_sf, kindergartens_sf, "NUM_KDRGTN", 350) %>%
  num_radius(., childcare_sf, "NUM_CHILDCARE", 350) %>%
  num_radius(., busstop_sf, "NUM_BUS_STOP", 350) %>%
  num_radius(., preschl_sf, "NUM_PRESCHL", 350) %>%
  num_radius(., prischl_sf, "NUM_PRISCHL", 1000)
hdb_testing_sf <- 
  num_radius(hdb_testing_sf, kindergartens_sf, "NUM_KDRGTN", 350) %>%
  num_radius(., childcare_sf, "NUM_CHILDCARE", 350) %>%
  num_radius(., busstop_sf, "NUM_BUS_STOP", 350) %>%
  num_radius(., preschl_sf, "NUM_PRESCHL", 350) %>%
  num_radius(., prischl_sf, "NUM_PRISCHL", 1000)

3.5 Cleaning Up Data

3.5.1 Remaining Lease

The remaining lease is formatted in “x years y months” which will be hard to process. With reference to senior’s submission (Megan), convert the remaining lease simply to years.

str_list <- str_split(hdb_training_sf$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      hdb_training_sf$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    hdb_training_sf$remaining_lease[i] <- year
  }
}
str_list <- str_split(hdb_testing_sf$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      hdb_testing_sf$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    hdb_testing_sf$remaining_lease[i] <- year
  }
}

3.5.2 Renaming Columns

For ease of calling in future steps, rename and standardise the column names as such.

hdb_training_sf <- hdb_training_sf %>%
  mutate() %>%
  rename("AREA_SQM" = "floor_area_sqm", 
         "LEASE_YRS" = "remaining_lease", 
         "PRICE"= "resale_price") %>%
  relocate(`PRICE`)
hdb_testing_sf <- hdb_testing_sf %>%
  mutate() %>%
  rename("AREA_SQM" = "floor_area_sqm", 
         "LEASE_YRS" = "remaining_lease", 
         "PRICE"= "resale_price") %>%
  relocate(`PRICE`)

3.5.3 Removing Unnecessary Columns

Lastly, remove all the unnecessary columns to reduce the file size.

final_hdb_training_sf <- hdb_training_sf %>%
  select(1,7,10:38)
final_hdb_testing_sf <- hdb_testing_sf %>%
  select(1,7,10:36)

3.6 Saving Data Frames

To remove the need for running all the codes above, I will save the data frames as .shp files for ease for future use.

st_write(final_hdb_training_sf, "data/geospatial/training.shp")
st_write(final_hdb_testing_sf, "data/geospatial/testing.shp")

3.7 Future Importation

hdb_training_sf <- st_read(
  dsn = "data/geospatial", layer = "training"
)
hdb_training_sf <- hdb_training_sf %>%
  mutate() %>%
  rename("AREA_SQM" = "AREA_SQ", 
         "LEASE_YRS" = "LEASE_Y", 
         "PROX_CBD" = "PROX_CB", 
         "PROX_ELDERCARE" = "PROX_EL", 
         "PROX_HAWKER"= "PROX_HA",
         "PROX_MRT" = "PROX_MR", 
         "PROX_PARK" = "PROX_PA", 
         "PROX_CEMETRY"= "PROX_CE",
         "NUM_KDRGTN" = "NUM_KDR", 
         "NUM_CHILDCARE" = "NUM_CHI", 
         "NUM_BUSSTOP"= "NUM_BUS",
         "NUM_PRESCHL" = "NUM_PRE", 
         "NUM_PRISCHL" = "NUM_PRI")
cbd_sf <- st_transform(cbd_sf, 3414)
proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- dist_matrix
  return(df1)
}
hdb_training_sf <- proximity(hdb_training_sf, cbd_sf, "PROX_CBD")
mall_sf <- st_transform(malls_sf, 3414)
hdb_training_sf <- proximity(hdb_training_sf, malls_sf, "PROX_MALL")

4 Exploratory Data Analysis

hdb_training_sf <- read_rds("data/hdb_training_sf.rds")

4.1 Distribution of Variables

To understand the distribution of each variable, plot out multiple histograms using the ggarange() function.

PRICE <- ggplot(data=hdb_training_sf, aes(x= `PRICE`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

AREA_SQM <- ggplot(data=hdb_training_sf, aes(x= `AREA_SQM`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_CBD <- ggplot(data=hdb_training_sf, aes(x= `PROX_CBD`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_ELDERCARE <- ggplot(data=hdb_training_sf, aes(x= `PROX_ELDERCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_HAWKER <- ggplot(data=hdb_training_sf, aes(x= `PROX_HAWKER`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MALL <- ggplot(data=hdb_training_sf, aes(x= `PROX_MALL`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MRT <- ggplot(data=hdb_training_sf, aes(x= `PROX_MRT`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_CEMETRY <- ggplot(data=hdb_training_sf, aes(x= `PROX_CEMETRY`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_PARK <- ggplot(data=hdb_training_sf, aes(x= `PROX_PARK`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

NUM_KDRGTN <- ggplot(data=hdb_training_sf, aes(x= `NUM_KDRGTN`)) +
  geom_histogram(bins=10, color="black", fill="light blue")

NUM_CHILDCARE <- ggplot(data=hdb_training_sf, aes(x= `NUM_CHILDCARE`)) +
  geom_histogram(bins=10, color="black", fill="light blue")

NUM_BUSSTOP <- ggplot(data=hdb_training_sf, aes(x= `NUM_BUSSTOP`)) +
  geom_histogram(bins=10, color="black", fill="light blue")

NUM_PRESCHL <- ggplot(data=hdb_training_sf, aes(x= `NUM_PRESCHL`)) +
  geom_histogram(bins=10, color="black", fill="light blue")

NUM_PRISCHL <- ggplot(data=hdb_training_sf, aes(x= `NUM_PRISCHL`)) +
  geom_histogram(bins=10, color="black", fill="light blue")

ggarrange(PRICE, AREA_SQM, PROX_CBD, PROX_ELDERCARE, PROX_HAWKER, PROX_MALL,
          PROX_MRT, PROX_PARK, PROX_CEMETRY, NUM_KDRGTN,
          NUM_CHILDCARE, NUM_BUSSTOP, NUM_PRESCHL, NUM_PRISCHL,  
          ncol = 3, nrow = 4)
$`1`


$`2`


attr(,"class")
[1] "list"      "ggarrange"

As seen from the distribution above, most variables result in a normal distribution. Both PRICE and AREA_SQM are rather skewed so the value can be normalised using the log transformation.

hdb_training_sf <- hdb_training_sf %>%
  mutate(`LOG_PRICE` = log(PRICE))
hdb_training_sf <- hdb_training_sf %>%
  mutate(`LOG_AREA_SQM` = log(AREA_SQM))
LOG_PRICE <- ggplot(data=hdb_training_sf, aes(x= `LOG_PRICE`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

LOG_AREA_SQM <- ggplot(data=hdb_training_sf, aes(x= `LOG_AREA_SQM`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

ggarrange(LOG_PRICE, LOG_AREA_SQM,
          ncol = 2, nrow = 1)

Now the distribution is less skewed and better for processing the model.

4.2 Statistical Point Map

tmap_mode("view")
tm_shape(mpsz)+
  tm_polygons() +
tm_shape(hdb_training_sf) +  
  tm_dots(col = "PRICE",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14)) +
tmap_options(check.and.fix = TRUE)

From this map, HDB 3-room flats with higher resale rates are concentrated mainly at the Sengkang - Punggol area and central area.

4.3 Variable Correlation

Before creating our models, it is important to understand if there are any strong correlations between variables as it can affect the accuracy of the model.

hdb_training_sf$LEASE_YRS <- as.numeric(hdb_training_sf$LEASE_YRS)
hdb_training_sf_noGeom <- hdb_training_sf %>%
  st_drop_geometry()
corrplot(cor(hdb_training_sf_noGeom[, 3:33]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, number.cex = 0.4, method = "number", type = "upper")

From the correlation model, we can see that there are 2 highly correlated variables: NUM_PRESCHL and NUM_CHILDCARE. There is also a slightly high correlated variable pair: NUM_PRESCHL and NUM_KDRGTN. Since both cases has NUM_PRESCHL as a common denominator, I will remove this variable from here onwards.

5 Hedonic Pricing Modelling

5.1 Final Training and Testing Data Set

final_training_sf <- hdb_training_sf %>%
  select(3,20:28,30:34)
hdb_testing_sf <- st_read(
  dsn = "data/geospatial", layer = "testing"
)
hdb_testing_sf <- proximity(hdb_testing_sf, cbd_sf, "PROX_CBD")
hdb_testing_sf <- proximity(hdb_testing_sf, malls_sf, "PROX_MALL")
hdb_testing_sf <- hdb_testing_sf %>%
  mutate() %>%
  rename("AREA_SQM" = "AREA_SQ", 
         "LEASE_YRS" = "LEASE_Y", 
         "PROX_ELDERCARE" = "PROX_EL", 
         "PROX_HAWKER"= "PROX_HA",
         "PROX_MRT" = "PROX_MR", 
         "PROX_PARK" = "PROX_PA", 
         "PROX_CEMETRY"= "PROX_CE",
         "NUM_KDRGTN" = "NUM_KDR", 
         "NUM_CHILDCARE" = "NUM_CHI", 
         "NUM_BUSSTOP"= "NUM_BUS",
         "NUM_PRESCHL" = "NUM_PRE", 
         "NUM_PRISCHL" = "NUM_PRI")
hdb_testing_sf <- hdb_testing_sf %>%
  mutate(`LOG_PRICE` = log(PRICE))
hdb_testing_sf <- hdb_testing_sf %>%
  mutate(`LOG_AREA_SQM` = log(AREA_SQM))
hdb_testing_sf$LEASE_YRS <- as.numeric(hdb_testing_sf$LEASE_YRS)
hdb_testing_sf_noGeom <- hdb_testing_sf %>%
  st_drop_geometry()
final_testing_sf <- hdb_testing_sf %>%
  select(3,19:33)
write_rds(final_testing_sf, "data/final_testing_sf.rds")
final_testing_sp <- as_Spatial(final_testing_sf)
final_testing_sf <- read_rds("data/final_testing_sf.rds")
final_training_sf <- read_rds("data/final_training_sf.rds")
final_testing_sf <- final_testing_sf %>%
  select(1:9,11:15)

5.2 Model 1: olsrr Method

5.2.1 Creating the MLR Model

olsrr_mlr <- lm(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS +
                   PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                 data=final_training_sf)
ols_regress(olsrr_mlr)
                        Model Summary                         
-------------------------------------------------------------
R                       0.786       RMSE               0.128 
R-Squared               0.617       Coef. Var          1.003 
Adj. R-Squared          0.617       MSE                0.016 
Pred R-Squared          0.616       MAE                0.101 
-------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                 ANOVA                                   
------------------------------------------------------------------------
               Sum of                                                   
              Squares           DF    Mean Square       F          Sig. 
------------------------------------------------------------------------
Regression    333.736           13         25.672    1558.612    0.0000 
Residual      206.860        12559          0.016                       
Total         540.597        12572                                      
------------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
         model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
--------------------------------------------------------------------------------------------
   (Intercept)     8.747         0.059                 147.923    0.000     8.631     8.863 
  LOG_AREA_SQM     0.871         0.014        0.358     63.141    0.000     0.844     0.898 
     LEASE_YRS     0.010         0.000        0.769    122.605    0.000     0.010     0.010 
      PROX_CBD     0.000         0.000       -0.510    -69.011    0.000     0.000     0.000 
PROX_ELDERCARE     0.000         0.000       -0.028     -4.880    0.000     0.000     0.000 
   PROX_HAWKER     0.000         0.000        0.006      1.006    0.315     0.000     0.000 
     PROX_MALL     0.000         0.000        0.014      1.543    0.123     0.000     0.000 
      PROX_MRT     0.000         0.000        0.017      2.900    0.004     0.000     0.000 
     PROX_PARK     0.000         0.000        0.065     10.806    0.000     0.000     0.000 
  PROX_CEMETRY    -0.009         0.001       -0.074     -8.693    0.000    -0.011    -0.007 
    NUM_KDRGTN     0.010         0.002        0.042      6.252    0.000     0.007     0.013 
 NUM_CHILDCARE    -0.003         0.001       -0.026     -3.868    0.000    -0.004    -0.001 
   NUM_BUSSTOP     0.003         0.000        0.032      5.537    0.000     0.002     0.003 
   NUM_PRISCHL     0.001         0.001        0.008      1.275    0.202    -0.001     0.003 
--------------------------------------------------------------------------------------------

5.2.2 gtsummary Table

tbl_regression(olsrr_mlr, 
               intercept = TRUE) %>% 
  add_glance_source_note(
    label = list(sigma ~ "\U03C3"),
    include = c(r.squared, adj.r.squared, 
                AIC, statistic,
                p.value, sigma))
Characteristic Beta 95% CI1 p-value
(Intercept) 8.7 8.6, 8.9 <0.001
LOG_AREA_SQM 0.87 0.84, 0.90 <0.001
LEASE_YRS 0.01 0.01, 0.01 <0.001
PROX_CBD 0.00 0.00, 0.00 <0.001
PROX_ELDERCARE 0.00 0.00, 0.00 <0.001
PROX_HAWKER 0.00 0.00, 0.00 0.3
PROX_MALL 0.00 0.00, 0.00 0.12
PROX_MRT 0.00 0.00, 0.00 0.004
PROX_PARK 0.00 0.00, 0.00 <0.001
PROX_CEMETRY -0.01 -0.01, -0.01 <0.001
NUM_KDRGTN 0.01 0.01, 0.01 <0.001
NUM_CHILDCARE 0.00 0.00, 0.00 <0.001
NUM_BUSSTOP 0.00 0.00, 0.00 <0.001
NUM_PRISCHL 0.00 0.00, 0.00 0.2
R² = 0.617; Adjusted R² = 0.617; AIC = -15,930; Statistic = 1,559; p-value = <0.001; σ = 0.128
1 CI = Confidence Interval

5.2.3 Checking Assumptions

It is important to check if the model violates any assumptions.

5.2.3.1 Multicolinearity

ols_vif_tol(olsrr_mlr)
        Variables Tolerance      VIF
1    LOG_AREA_SQM 0.9471019 1.055853
2       LEASE_YRS 0.7753874 1.289678
3        PROX_CBD 0.5576355 1.793286
4  PROX_ELDERCARE 0.9296981 1.075618
5     PROX_HAWKER 0.8423827 1.187109
6       PROX_MALL 0.3892681 2.568923
7        PROX_MRT 0.8459462 1.182108
8       PROX_PARK 0.8534316 1.171740
9    PROX_CEMETRY 0.4240868 2.358008
10     NUM_KDRGTN 0.6781597 1.474579
11  NUM_CHILDCARE 0.6563563 1.523563
12    NUM_BUSSTOP 0.9192414 1.087854
13    NUM_PRISCHL 0.7491127 1.334913

Variables X04TO06, X01TO03, X07TO09, X10TO12, X13TO15 and X16TO18 have high multicolinearity as they have VIF values significantly larger than 10.

5.2.3.2 Non-Linearity

ols_plot_resid_fit(olsrr_mlr)

Most scattered data points are around the 0 line. Relationship between the dependent variable and independent variables are linear.

5.2.3.3 Normality

ols_plot_resid_hist(olsrr_mlr)

The figure above shows that there is normal distribution in the MLR model.

5.2.3.4 Spatial Autocorrelation

mlr.output <- as.data.frame(olsrr_mlr$residuals)
final.training.res.sf <- cbind(final_training_sf, 
                        olsrr_mlr$residuals) %>%
rename(`MLR_RES` = `olsrr_mlr.residuals`)
final_training_sp <- as_Spatial(final.training.res.sf)
final_training_sp
class       : SpatialPointsDataFrame 
features    : 12573 
extent      : 11597.97, 45192.3, 28097.64, 48622.47  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 15
names       : LEASE_YRS,         PROX_CBD, PROX_ELDERCARE, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_CEMETRY, NUM_KDRGTN, NUM_CHILDCARE, NUM_BUSSTOP, NUM_PRISCHL, PROX_MALL,        LOG_PRICE,     LOG_AREA_SQM,            MLR_RES 
min values  :     43.08, 721.958623503081,              1,           1,        1,         3,            1,          0,             0,           1,           0,         3, 12.2060726455302, 3.93182563272433, -0.510818871329572 
max values  :     97.42, 19650.0691667807,            133,         125,      199,       418,           10,          5,            13,          16,           7,       167, 14.0529514139793, 5.48479693349065,  0.740187765202039 
tm_shape(mpsz)+
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
tm_shape(final.training.res.sf) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))

There is sign of spatial autocorrelation. To further confirm this observation, perform the Moran’s I test.

nb <- dnearneigh(coordinates(final_training_sp), 0, 1500, longlat = FALSE)
nb_lw <- nb2listw(nb, style = 'W', zero.policy = TRUE)
lm.morantest(olsrr_mlr, nb_lw)

    Global Moran I for regression residuals

data:  
model: lm(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + PROX_CBD +
PROX_ELDERCARE + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK +
PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
data = final_training_sf)
weights: nb_lw

Moran I statistic standard deviate = NA, p-value = NA
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
              NA               NA               NA 

As such, I checked for NA values in the nb list. However there were none.

sum(is.na(nb))
[1] 0

5.3 Model 2: GWR with Fixed Bandwidth

5.3.1 Computing Fixed Bandwidth

bw.fixed <- bw.gwr(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                 data=final_training_sp, 
                   approach="CV", 
                   kernel="gaussian", 
                   adaptive=FALSE, 
                   longlat=FALSE)
#write_rds(bw.fixed, "data/model/bw_fixed.rds")
bw.fixed <- read_rds("data/model/bw_fixed.rds")
bw.fixed
[1] 2714.882
gwr.fixed <- gwr.basic(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                 data=final_training_sp,
                       bw=bw.fixed, 
                       kernel = 'gaussian', 
                       longlat = FALSE)
#write_rds(gwr.fixed, "data/model/gwr_fixed.rds")
gwr.fixed <- read_rds("data/model/gwr_fixed.rds")
gwr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-26 00:54:43 
   Call:
   gwr.basic(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + PROX_CBD + 
    PROX_ELDERCARE + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + 
    PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + 
    NUM_PRISCHL, data = final_training_sp, bw = bw.fixed, kernel = "gaussian", 
    longlat = FALSE)

   Dependent (y) variable:  LOG_PRICE
   Independent variables:  LOG_AREA_SQM LEASE_YRS PROX_CBD PROX_ELDERCARE PROX_HAWKER PROX_MALL PROX_MRT PROX_PARK PROX_CEMETRY NUM_KDRGTN NUM_CHILDCARE NUM_BUSSTOP NUM_PRISCHL
   Number of data points: 12573
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
     Min       1Q   Median       3Q      Max 
-0.51082 -0.08522 -0.00178  0.08261  0.74019 

   Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
   (Intercept)     8.747e+00  5.913e-02 147.923  < 2e-16 ***
   LOG_AREA_SQM    8.714e-01  1.380e-02  63.141  < 2e-16 ***
   LEASE_YRS       1.000e-02  8.158e-05 122.605  < 2e-16 ***
   PROX_CBD       -2.387e-05  3.459e-07 -69.011  < 2e-16 ***
   PROX_ELDERCARE -1.561e-04  3.198e-05  -4.880 1.07e-06 ***
   PROX_HAWKER     3.544e-05  3.524e-05   1.006  0.31460    
   PROX_MALL       6.929e-05  4.492e-05   1.543  0.12291    
   PROX_MRT        5.696e-05  1.964e-05   2.900  0.00374 ** 
   PROX_PARK       1.115e-04  1.032e-05  10.806  < 2e-16 ***
   PROX_CEMETRY   -8.677e-03  9.982e-04  -8.693  < 2e-16 ***
   NUM_KDRGTN      9.837e-03  1.574e-03   6.252 4.19e-10 ***
   NUM_CHILDCARE  -2.748e-03  7.105e-04  -3.868  0.00011 ***
   NUM_BUSSTOP     2.510e-03  4.533e-04   5.537 3.14e-08 ***
   NUM_PRISCHL     1.227e-03  9.624e-04   1.275  0.20221    

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 0.1283 on 12559 degrees of freedom
   Multiple R-squared: 0.6173
   Adjusted R-squared: 0.617 
   F-statistic:  1559 on 13 and 12559 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 206.8603
   Sigma(hat): 0.1282784
   AIC:  -15929.99
   AICc:  -15929.95
   BIC:  -28249.81
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2714.882 
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                         Min.     1st Qu.      Median     3rd Qu.    Max.
   Intercept       3.0909e+00  8.0651e+00  9.3452e+00  1.0326e+01 11.9255
   LOG_AREA_SQM    3.5974e-01  5.4312e-01  7.3788e-01  9.8309e-01  1.1350
   LEASE_YRS       5.9542e-03  8.1298e-03  9.8426e-03  1.2571e-02  0.0136
   PROX_CBD       -4.1569e-05 -2.9924e-05 -2.2437e-05 -1.3149e-05  0.0000
   PROX_ELDERCARE -5.7659e-04 -2.0793e-04 -1.1645e-04  3.3698e-05  0.0004
   PROX_HAWKER    -7.7823e-04 -3.3369e-04 -5.6847e-06  3.3596e-04  0.0007
   PROX_MALL      -5.0780e-03 -9.9860e-04 -7.6257e-05  4.5251e-04  0.0010
   PROX_MRT       -4.6867e-04 -9.8345e-05 -6.3679e-05 -2.0633e-05  0.0004
   PROX_PARK      -5.1145e-04 -5.4664e-05  1.0989e-04  1.4916e-04  0.0004
   PROX_CEMETRY   -4.8910e-02 -2.8086e-02 -1.1262e-02 -1.4587e-03  0.7320
   NUM_KDRGTN     -1.9520e-02 -7.3935e-03  1.5232e-03  1.6971e-02  0.0275
   NUM_CHILDCARE  -1.1003e-02 -5.6461e-03 -9.5256e-04  4.7756e-03  0.0098
   NUM_BUSSTOP    -5.0404e-03  1.8327e-03  3.2050e-03  6.9537e-03  0.0151
   NUM_PRISCHL    -2.0774e-02  6.2952e-04  2.8630e-03  5.8811e-03  0.0390
   ************************Diagnostic information*************************
   Number of data points: 12573 
   Effective number of parameters (2trace(S) - trace(S'S)): 130.9443 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 12442.06 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -20328.3 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -20434.01 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): -32146.26 
   Residual sum of squares: 143.7518 
   R-square value:  0.7340868 
   Adjusted R-square value:  0.731288 

   ***********************************************************************
   Program stops at: 2023-03-26 00:56:33 

5.3.2 Prediction

gwrFixed_pred <- gwr.predict(LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                 data=final_training_sp,
                 predictdata = final_testing_sp,
                       bw=bw.fixed, 
                       kernel = 'gaussian', 
                       longlat = FALSE)

5.4 Model 3: GWR with Adaptive Bandwidth

bw.adaptive <- bw.gwr(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                 data=final_training_sp,
                      approach="CV", 
                      kernel="gaussian", 
                      adaptive=TRUE, 
                      longlat=FALSE)
#write_rds(bw.adaptive, "data/model/bw_adaptive.rds")
bw.adaptive <- read_rds("data/model/bw_adaptive.rds")
bw.adaptive
[1] 1136
gwr.adaptive <- gwr.basic(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                 data=final_training_sp,
                 bw=bw.adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE, 
                          longlat = FALSE)
#write_rds(gwr.adaptive, "data/model/gwr_adaptive.rds")
gwr.adaptive <- read_rds("data/model/gwr_adaptive.rds")
gwr.adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-26 01:42:23 
   Call:
   gwr.basic(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + PROX_CBD + 
    PROX_ELDERCARE + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + 
    PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + 
    NUM_PRISCHL, data = final_training_sp, bw = bw.adaptive, 
    kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  LOG_PRICE
   Independent variables:  LOG_AREA_SQM LEASE_YRS PROX_CBD PROX_ELDERCARE PROX_HAWKER PROX_MALL PROX_MRT PROX_PARK PROX_CEMETRY NUM_KDRGTN NUM_CHILDCARE NUM_BUSSTOP NUM_PRISCHL
   Number of data points: 12573
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
     Min       1Q   Median       3Q      Max 
-0.51082 -0.08522 -0.00178  0.08261  0.74019 

   Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
   (Intercept)     8.747e+00  5.913e-02 147.923  < 2e-16 ***
   LOG_AREA_SQM    8.714e-01  1.380e-02  63.141  < 2e-16 ***
   LEASE_YRS       1.000e-02  8.158e-05 122.605  < 2e-16 ***
   PROX_CBD       -2.387e-05  3.459e-07 -69.011  < 2e-16 ***
   PROX_ELDERCARE -1.561e-04  3.198e-05  -4.880 1.07e-06 ***
   PROX_HAWKER     3.544e-05  3.524e-05   1.006  0.31460    
   PROX_MALL       6.929e-05  4.492e-05   1.543  0.12291    
   PROX_MRT        5.696e-05  1.964e-05   2.900  0.00374 ** 
   PROX_PARK       1.115e-04  1.032e-05  10.806  < 2e-16 ***
   PROX_CEMETRY   -8.677e-03  9.982e-04  -8.693  < 2e-16 ***
   NUM_KDRGTN      9.837e-03  1.574e-03   6.252 4.19e-10 ***
   NUM_CHILDCARE  -2.748e-03  7.105e-04  -3.868  0.00011 ***
   NUM_BUSSTOP     2.510e-03  4.533e-04   5.537 3.14e-08 ***
   NUM_PRISCHL     1.227e-03  9.624e-04   1.275  0.20221    

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 0.1283 on 12559 degrees of freedom
   Multiple R-squared: 0.6173
   Adjusted R-squared: 0.617 
   F-statistic:  1559 on 13 and 12559 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 206.8603
   Sigma(hat): 0.1282784
   AIC:  -15929.99
   AICc:  -15929.95
   BIC:  -28249.81
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 1136 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                         Min.     1st Qu.      Median     3rd Qu.    Max.
   Intercept       6.1502e+00  8.1054e+00  9.3081e+00  9.9747e+00 10.7299
   LOG_AREA_SQM    4.6428e-01  6.2619e-01  7.6128e-01  9.7894e-01  1.2517
   LEASE_YRS       6.8669e-03  8.6759e-03  9.9274e-03  1.2394e-02  0.0137
   PROX_CBD       -3.9767e-05 -2.9481e-05 -2.4985e-05 -1.2530e-05  0.0000
   PROX_ELDERCARE -3.7688e-04 -2.3458e-04 -1.3274e-04  5.1452e-06  0.0002
   PROX_HAWKER    -7.9498e-04 -2.5034e-04  3.3952e-05  3.3260e-04  0.0007
   PROX_MALL      -2.5818e-03 -6.7107e-04 -1.4139e-05  4.7366e-04  0.0010
   PROX_MRT       -3.6988e-04 -1.1154e-04 -3.8094e-05  1.4233e-05  0.0002
   PROX_PARK      -4.6720e-04 -2.6649e-05  1.0632e-04  1.4536e-04  0.0002
   PROX_CEMETRY   -9.0603e-02 -2.9173e-02 -1.1126e-02 -4.3526e-03  0.3763
   NUM_KDRGTN     -1.8435e-02 -5.6958e-03  6.8287e-03  1.9178e-02  0.0298
   NUM_CHILDCARE  -1.2890e-02 -5.4920e-03 -2.0900e-03  3.6946e-03  0.0096
   NUM_BUSSTOP    -4.8973e-03  1.0591e-03  3.3391e-03  6.4189e-03  0.0118
   NUM_PRISCHL    -1.8117e-02  1.3327e-03  3.0800e-03  6.0728e-03  0.0302
   ************************Diagnostic information*************************
   Number of data points: 12573 
   Effective number of parameters (2trace(S) - trace(S'S)): 99.18713 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 12473.81 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -19882.66 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -19961.55 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): -31893.77 
   Residual sum of squares: 149.5661 
   R-square value:  0.7233314 
   Adjusted R-square value:  0.7211312 

   ***********************************************************************
   Program stops at: 2023-03-26 01:44:26 
gwrAdaptive_pred <- gwr.predict(LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                 final_training_sp,
                 final_testing_sp,
                 bw.adaptive, 
                       kernel = 'gaussian', 
                       longlat = FALSE)

Similarly, it returned the same error that was said could be due to high multicolinearity between variables.

5.5 Model 4: Random Forest

5.5.1 Preparing Coordinate Data

final_testing_sf <- read_rds("data/final_testing_sf.rds")
final_testing_sp <- as_Spatial(final_testing_sf)
final_testing_sf <- final_testing_sf %>%
  select(1:9,11:15)

coords_train <- st_coordinates(final_training_sf)
coords_test <- st_coordinates(final_testing_sf)
train_data <- final_training_sf %>%
  st_drop_geometry()
write_rds(coords_train, "data/coords_train.rds")
write_rds(coords_test, "data/coords_test.rds")
write_rds(final_training_sf, "data/final_training_sf.rds")
write_rds(train_data, "data/train_data.rds")

train_data <- read_rds("data/train_data.rds")
coords_train <- read_rds("data/coords_train.rds")
set.seed(1234)
rf <- ranger(LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
             data=train_data, num.trees = 500)
print(rf)
Ranger result

Call:
 ranger(LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + PROX_CBD + PROX_ELDERCARE +      PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY +      NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL, data = train_data,      num.trees = 500) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      12573 
Number of independent variables:  13 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.005299921 
R squared (OOB):                  0.8767462 

5.5.2 Calibrating Geographical RF Model

set.seed(1234)
gwRF_adaptive <- grf(formula = LOG_PRICE ~ LOG_AREA_SQM + LEASE_YRS + 
                  PROX_CBD + PROX_ELDERCARE  + PROX_HAWKER + PROX_MALL + PROX_MRT + PROX_PARK + PROX_CEMETRY + NUM_KDRGTN + NUM_CHILDCARE + NUM_BUSSTOP + NUM_PRISCHL,
                     dframe=train_data, 
                     bw=55,
                     kernel="adaptive",
                     coords=coords_train, ntree = 10)

6 Evaluation and Conclusion

Model Adjusted R-square
1. olsrr MLR 0.617
2. GWR with Fixed Bandwidth 0.731288
3. GWR with Adaptive Bandwith 0.7211312
4. Random Forest (Ranger) 0.8767462

According to the adjusted R-square values, the best model would be Random forest using the ranger package.

Unfortunately, due to the lack of understanding and computational limitations, there was no prediction model that I could generate predictions out of. Will try to understand it better after this take home exercise!